Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose

نویسندگان

Fred Morstatter

Jürgen Pfeffer

Huan Liu

Kathleen M. Carley

چکیده

Twitter is a social media giant famous for the exchange of short, 140-character messages called “tweets”. In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a “Streaming API” which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter’s sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two 1%s Don't Make a Whole: Comparing Simultaneous Samples from Twitter's Streaming API

In the present work, we compare samples of tweets from the Twitter Streaming API that were constructed from different connections tracking the same popular keywords at the same time. We find that tweets from the Streaming API are not sampled at random; rather, on average over 96% of the tweets seen in one sample are seen in all others. Somewhat surprisingly, however, tweets found only in a subs...

متن کامل

Methods for Coding Tobacco-Related Twitter Data: A Systematic Review

BACKGROUND As Twitter has grown in popularity to 313 million monthly active users, researchers have increasingly been using it as a data source for tobacco-related research. OBJECTIVE The objective of this systematic review was to assess the methodological approaches of categorically coded tobacco Twitter data and make recommendations for future studies. METHODS Data sources included PsycIN...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

"Time for dabs": Analyzing Twitter data on marijuana concentrates across the U.S.

AIMS Media reports suggest increasing popularity of marijuana concentrates ("dabs"; "earwax"; "budder"; "shatter; "butane hash oil") that are typically vaporized and inhaled via a bong, vaporizer or electronic cigarette. However, data on the epidemiology of marijuana concentrate use remain limited. This study aims to explore Twitter data on marijuana concentrate use in the U.S. and identify dif...

متن کامل

A Case Study of the New York City 2012-2013 Influenza Season With Daily Geocoded Twitter Data From Temporal and Spatiotemporal Perspectives

BACKGROUND Twitter has shown some usefulness in predicting influenza cases on a weekly basis in multiple countries and on different geographic scales. Recently, Broniatowski and colleagues suggested Twitter's relevance at the city-level for New York City. Here, we look to dive deeper into the case of New York City by analyzing daily Twitter data from temporal and spatiotemporal perspectives. Al...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1306.5204 شماره

صفحات -

تاریخ انتشار 2013

Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose

نویسندگان

چکیده

منابع مشابه

Two 1%s Don't Make a Whole: Comparing Simultaneous Samples from Twitter's Streaming API

Methods for Coding Tobacco-Related Twitter Data: A Systematic Review

Design and Test of the Real-time Text mining dashboard for Twitter

"Time for dabs": Analyzing Twitter data on marijuana concentrates across the U.S.

A Case Study of the New York City 2012-2013 Influenza Season With Daily Geocoded Twitter Data From Temporal and Spatiotemporal Perspectives

عنوان ژورنال:

اشتراک گذاری